Web Scraping in Python Series - Scraping CSO Data

Introduction

In this example we will be utilizing data provided for public use within our system. Presenting useful insights or "value-added" services are common purposes for extracting information from other websites. In this example we would like to evaluate how prices for commodities change over time using reports generated by the Trinidad and Tobago government.

In this example we will be using information generated by the Ministry of Trade and Industry relating to "Supermarket Prices of Food Items in Trinidad and Tobago" We will be using information for:

In this example we will be utilizing the popular utility call Tabula to extract tables from the pdf. However, this data is written in Java and our systems utilize Java. Therefore we will utilize a package in python that will provide a wrapper around the java system

pip install tabula-py

or

pip install git+https://github.com/kyledef/tabula-py.git

In [ ]:
import os
import tabula

In [ ]:
dir_path = os.getcwd()
pgs = range(3, 28)

In [ ]:
meta = {}
# Section 1
meta['arima'] = ['xtra foods', 'massy stores']
meta['barataria'] = ['food giant', 'jumbo foods']
meta['chaguanas'] = ['price club', 'xtra foods']
meta['couva'] = ['cash & carry', 'toolsies']
meta['cunupia'] = ['low cost', 'one plus one']

# Section 2
meta['curepe'] = ['massy stores','tru valu']
meta['debe'] = ['ms food city','g & n']
meta['diego martin'] = ['tru valu', 'massy stores']
meta['mayaro'] = ['s & s persad', 'persard d food king']
meta['point fortin'] = ['peoping', 'Persad’s']

In [ ]:
pdf ="{0}/docs/pdfs/Supermarket-Prices-BookletDec2016.pdf".format(dir_path)
tabula.convert_into(pdf,"output.csv",output_format="csv", pages=pgs, spreadsheet=False)

In [ ]: